Data Gathering

To see all raw data gathered click here

Quarterly and Annual Ridership Totals by Mode​ of Transportation 1

The initial piece of data that was gathered comes from the American Public Transportation Association, and can serve as an introductory synopsis of the state of public transit ridership over time. This gives a broad view of quarterly ridership across the entire country from 1990 onward. Thus, this data has been chosen for the potential of setting the stage for the problem which we intend to explore.

The raw data and methodology for how it was obtained can be found using this link: https://www.apta.com/research-technical-resources/transit-statistics/ridership-report/

The data itself can be downloaded using this link: https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx

To download this data, I used an R API tool, which saves the data in Excel format. Below is the code for this action and a screenshot of the raw data to illustrate its form upon download:

Code
library(readxl)
library(httr)
url1<-'https://www.apta.com/wp-content/uploads/APTA-Ridership-by-Mode-and-Quarter-1990-Present.xlsx'
GET(url1, write_disk(tf <- tempfile(pattern = "APTA-Ridership-by-Mode-and-Quarter-1990-Present", fileext = ".xlsx", tmpdir = "../data")))
df <- read_excel(tf, 2L)
str(df)

Quarterly and Annual Ridership Totals by Mode​ of Transportation

News API Data 2

An essential part of understanding public perception of a topic is by assessing how it is covered in the news. This often informs general opinions, and can introduce conversations that had not previously been in the zeitgeist. Thus, this paper will analyze text data from https://newsapi.org/ to allow us to study news coverage on two distinct public transit systems.

For this project, I will be looking at data regarding the Washington Metropolitan Area Transit Authority (WMATA) and the Bay Area Rapid Transit (BART). Both of these transit systems have several advantages for academic study: they are both large networks with rich histories and connections to their respective cities, there exist robust data sources allowing us to analyze information from several angles, and comparing them will allow us to get perspectives on differences between cities on opposite coasts.

The following shows how I accessed this News API via Python code. This outputs a JSON file as raw data, the start of which is included below each code block to show the nature of the data prior to cleaning.

Code
import requests
import json
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='581fd71df234408291300dc13f0ee6e8'
TOPIC='wmata'

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

response = requests.get(baseURL, URLpost)
response = response.json()

with open('WMATA-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

Raw JSON data from News API; Topic: WMATA
Code
baseURL = "https://newsapi.org/v2/everything?"
total_requests=2
verbose=True

API_KEY='581fd71df234408291300dc13f0ee6e8'
TOPIC='Bay Area Rapid Transit'

URLpost = {'apiKey': API_KEY,
            'q': '+'+TOPIC,
            'sortBy': 'relevancy',
            'totalRequests': 1}

response = requests.get(baseURL, URLpost)
response = response.json()

with open('BART-newapi-raw-data.json', 'w') as outfile:
    json.dump(response, outfile, indent=4)

Raw JSON data from News API; Topic: Bay Area Rapid Transit

Ridership by Hour

In addition to the volume of public transit usage, we can glean information on the purpose of public transit usage by analyzing the users by hour of the day. High peaks during “rush hour” likely indicate a great influence of work commuting on the data. Because of this, I downloaded both pre-pandemic and post-pandemic data sets regarding WMATA ridership by hour to view this relationship and whether it has changed due to new circumstances. In this case, March 17, 2020 is chosen as the demarcation date, as that was the day in which the first social distancing precautions were announced in Washington, D.C. The data is shown below:

Hourly Ridership from 1/1/2018 to 3/17/2020

Hourly Ridership from 3/18/2020 to 10/5/2023

Ridership by Demographic 6

In answering the question of whether or not public transit’s public service should be the paramount consideration for its efficacy, it is important to understand that it often provides service disproportionally to underprivileged groups. By analyzing demographic data, we can gather insights on who benefits most from robust public transit systems. To address this, there is data from the U.S. Census Bureau that provides 5-year estimates from 2021 of means of transportation to work by selected characteristics. The raw data is shown below:

2021 5-Year Estimate of Transportation Means by Demographic

Commute to Work by Demographic 7

Similar to the previous dataset, we want to further address the idea that public transit affects different groups of people to varying extents. Therefore, a source of data that will be useful is a survey dataset from ipums.org, which has millions of survey responses from the U.S. Census Bureau. The main reason for obtaining this data is the presence of a “Means of transportation to work” field, which will serve as our labels. Additionally, various demographics fields can be used to perform Naive Bayes classification later on. This data was obtained by submitting a download request for 2021 Census data and the following fields below:

2021 Survey Fields

WMATA and BART Yelp Reviews 8 9

Gauging public sentiment regarding public transit systems can be a great way to analyze the relationship between said system and the residents of its respective city. Regardless of external factors, consumer dissatisfaction of a mode of transportation could greatly influence its usage when other methods are readily available for many. Thus, these datasets will feature Yelp reviews of both the WMATA and BART systems, including the date of review, the exact text, and the associated numerical rating (1-5 stars). Gathering labeled text data will be invaluable for Naive Bayes classification in the future. To accomplish this, I will use the BeautifulSoup package in Python, which facilitates web scraping via HTML codes. Since the reviews span several pages, it is necessary to iterate over each page to obtain every review available to us. The code for this is below, along with screenshots of the raw data after being collated into dataframes.

Code
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.yelp.com/biz/wmata-washington')
soup = BeautifulSoup(r.text, 'html.parser')
reviews = soup.find(string='Recommended Reviews').find_parent('section')
num_rating = []
review_date = []
review_text = []
for review in reviews.select('div[aria-label$="star rating"]'):
    num_rating.append(review['aria-label'])
    review_date.append(review.find_next('span').text)
    review_text.append(review.find_next('span', lang=True).text)

for i in range(1,10):
    r = requests.get('https://www.yelp.com/biz/wmata-washington?start=' + str(i) + '0')
    soup = BeautifulSoup(r.text, 'html.parser')
    reviews = soup.find(string='Recommended Reviews').find_parent('section')
    for review in reviews.select('div[aria-label$="star rating"]'):
        num_rating.append(review['aria-label'])
        review_date.append(review.find_next('span').text)
        review_text.append(review.find_next('span', lang=True).text)

wmata_reviews = pd.DataFrame(list(zip(num_rating,review_date,review_text)))
wmata_reviews.to_csv('../data/wmata_reviews.csv')

WMATA Yelp Reviews
Code
r = requests.get('https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2')
soup = BeautifulSoup(r.text, 'html.parser')
hold1 = soup.find(string='Recommended Reviews')
if hold1 is not None:
    reviews = hold1.find_parent('section')
num_rating1 = []
review_date1 = []
review_text1 = []
for review in reviews.select('div[aria-label$="star rating"]'):
    num_rating1.append(review['aria-label'])
    review_date1.append(review.find_next('span').text)
    hold = review.find_next('span', lang=True)
    if hold is None:
        review_text1.append("NA")
    else:
        review_text1.append(hold.text)

for i in range(1,101):
    r = requests.get('https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2?start=' + str(i) + '0')
    soup = BeautifulSoup(r.text, 'html.parser')
    hold1 = soup.find(string='Recommended Reviews')
    if hold1 is not None:
        reviews = hold1.find_parent('section')
    for review in reviews.select('div[aria-label$="star rating"]'):
        num_rating1.append(review['aria-label'])
        review_date1.append(review.find_next('span').text)
        hold = review.find_next('span', lang=True)
        if hold is None:
            review_text1.append("NA")
        else:
            review_text1.append(hold.text)

bart_reviews = pd.DataFrame(list(zip(num_rating1,review_date1,review_text1)))
bart_reviews.to_csv('../data/bart_reviews.csv')

BART Yelp Reviews

Footnotes

  1. “Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎

  2. “News API – Search News and Blog Articles on the Web.” News API €“ Search News and Blog Articles on the Web, newsapi.org/. Accessed 12 Oct. 2023.↩︎

  3. Barrero, Jose Maria, et al. Why Working from Home Will Stick, 2021, https://doi.org/10.3386/w28731.↩︎

  4. “Washington Metropolitan Area Transit Authority.” WMATA, www.wmata.com/initiatives/ridership-portal/. Accessed 12 Oct. 2023.↩︎

  5. “Ridership Reports.” Ridership Reports | Bay Area Rapid Transit, www.bart.gov/about/reports/ridership. Accessed 13 Oct. 2023.↩︎

  6. U.S. Census Bureau. “MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS.” American Community Survey, ACS 5-Year Estimates Subject Tables, Table S0802, 2021, https://data.census.gov/table/ACSST5Y2021.S0802?t=Commuting&g=860XX00US20020,20032. Accessed on October 12, 2023.↩︎

  7. Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎

  8. “WMATA - Washington, DC, DC,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 2, 2023).↩︎

  9. “Bart - Bay Area Rapid Transit - Oakland, CA,” Yelp, https://www.yelp.com/biz/bart-bay-area-rapid-transit-oakland-2 (accessed Nov. 2, 2023).↩︎